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Abstract —This article investigates a data-driven approach for 
semantically scene understanding, without pixelwise annotation 
and classifier training. Our framework parses a target image with 
two steps: (i) retrieving its exemplars (i.e. references) from an 
image database, where all images are unsegmented but annotated 
with tags; (ii) recovering its pixel labels by propagating semantics 
from the references. We present a novel framework making 
the two steps mutually conditional and bootstrapped under the 
probabilistic Expectation-Maximization (EM) formulation. In the 
first step, the references are selected by jointly matching their 
appearances with the target as well as the semantics (i.e. the 
assigned labels of the target and the references). We process 
the second step via a combinatorial graphical representation, in 
which the vertices are superpixels extracted from the target and 
its selected references. Then we derive the potentials of assigning 
labels to one vertex of the target, which depend upon the graph 
edges that connect the vertex to its spatial neighbors of the target 
and to its similar vertices of the references. Besides, the proposed 
framework can be naturally applied to perform image annotation 
on new test images. In the experiments, we validate our approach 
on two public databases, and demonstrate superior performances 
over the state-of-the-art methods in both semantic segmentation 
and image annotation tasks. 

Index Terms —scene understanding, semantic segmentation, 
image retrieval, graphical model, image annotation 
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1. Introduction 

S ignificant progresses have been identified in solving the 
task of semantic image understanding M, ID However, 
these methods usually build upon supervised learning with 
fully annotated data that are expensive and sometimes limited 
in large-scale scenarios (3, f3. Several weakly supervised 
methods were proposed 113 to reduce the overload of data 
annotating, which can be trained with only image-level labels 
indicating the classes presented in the images. Recently, data- 
driven approaches m, HU receive increasing attentions, 
which tend to leverage knowledges from auxiliary data in 
weakly supervised fashions, and demonstrate very promising 
applications. Following this trend, one interesting but chal¬ 
lenging problem arises for the scene understanding: How to 
parse the raw images in virtue of the strength of numerous 
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Fig. I. A glance of our framework, where we semantically segment the 
target image in a self-driven fashion: The algorithm iterates to retrieve (c) the 
exemplars matching with the target from (b) the auxiliary data , and (a) parse 
the target image in the virtue of the strength of the selected exemplars. 


unsegmented but tagged images, as the image-level tags can 
be achieved easier. In this work, we investigate this problem by 
developing a unified framework, in which the two following 
steps perform iteratively, as Fig. illustrates. 

In Step. 1, we search for similar images as the exemplars 
(i.e. references) matching to the target image from the auxil¬ 
iary database (in Fig. 1 (b)), and these references are required 
to share similar semantic concepts with the target. Moreover, 
we enforce the representation to be semantically meaningful: 
The references that are selected should contain consistent tags. 
The tags of the target image can be also taken into account 
during the iteration, as they can be determined by the last 
label assignment step (in Step. 2). We solve this step using 
the proximal gradient method. 

In Step. 2, we assign labels to the pixels of the target by 
propagating semantics from the selected references. We create 
a graphical model, in which the vertices are the superpixels 
from the target image and its references. There are two types 
of edges defined over the graph, which is inspired by 0: 
(i) the inner-edges connecting the spatial adjacent vertices 
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within the target; (ii) the outer-edges connecting the vertices 
of the target to those of its references. The potentials are then 
derived into an MRF form by aggregating the two types of 
edge connections, which can be fast solved by the Graph Cuts 
algorithm Q. 

The two above steps are mutually conditional, providing 
complementary information to each other. We present a novel 
probabilistic Expectation-Maxima (EM) formulation making 
the two steps bootstrapped by each other to conduct results in 
a self-driven manner. In addition, the proposed framework can 
also be directly applied on new test image to perform multi¬ 
label image annotation. Our approach is evaluated on several 
benchmarks, and outperforms other state-of-the-art methods. 

II. Related work 

Traditional efforts for scene understanding mainly focused 
on capturing scene appearances, structures and spatial contexts 
by developing combinatorial models, e.g., CRF (141, 0, 
Texton-Forest lua, Graph Grammar 0. These models were 
generally founded on supervised learning techniques, and 
required manually prepared training data containing labels at 
pixel level. 

Several weakly supervised methods are proposed to indicate 
the classes presented in the images with only image-level 
labels. For example, Winn et al. ca proposed to learn object 
classes based on unsupervised image segmentation. Zhang et 
al. Cll learned classification models for all scene labels by 
selecting representative training samples, and multiple instance 
learning was utilized in ca. 

Some nonparametric approaches have been also studied 
that solve the problems by searching and matching with an 
auxiliary image database. For example, an efficient structure- 
aware matching algorithm was discussed in inni to transfer 
labels from the database to the target image, but the pixelwise 
annotation was required for the auxiliary images. 

III. Problem Formulation 

In this section, we phrase the problem in a probabilistic 
formulation, and then discuss the Expectation-Maximization 
(EM) inference framework for optimization. 


A. Probability Model 

Let A = denote a set of images {1^} with 

image-level labels {Lk}. Each image Ik is represented as a set 
of superpixels where n/. is the number of superpixels 

in Ik. 

Given the target image It, our task is to predict its image- 
level labels Lf, as well as to assign each superpixel xj a label 
yl e Lf. Let Yt denote the whole label assignment, i.e., Yt = 
define the joint probability distribution of 
target image It and the label assignment Yt. 

We also define a binary-valued correspondence variable 
OL = {ak}k=i such that ak = I if image Ik is selected as a 
reference for the target image, a is treated as a hidden variable. 

The complete probability model is defined as follow. 


P{IuYt, cx\A) = P{It, Yt\cx, A)P{a), 


and we further derive it by summing out a as, 

P{It,Yt\A) = y2P{It,Yt\cx,A)P{cx). (2) 

oc 

Then the optimal label assignment Yt by maximizing the 
probability, 

Y* = arg P{IuYt\A), (3) 

and we propose to solve it iteratively under an Expectation- 
Maximization (EM) framework. 



Fig. 2. Illustration of the semantic-aware sparse coding. Top: The target 
image is denoted by the pentagon and each auxiliary image denoted by an 
triangle. The darked triangles represent the images selected as the references, 
bottom: The grey squares represent semantic labels that are introduced as 
constraints during the optimization. And we select a subset of auxiliary images 
as references for the target image. 


B. The EM Iterations 

It has been shown that estimating Y^ from P{It,Yt\A) is 
equivalent to minimize the following energy function (T^ : 

C{Q,Yt) = -Y^Q{cx)\nP{IuYuCx\A)+Y^Q{cx)\nQ{cx), 

CX OC 

(4) 

where Q{ol) is the posterior of the latent variable ol. 

Since the second term in Eq. 0 is a constant, the opti¬ 
mization iterates with two steps: (i) The E-step minimizes the 
energy C{Q,Yt) with respect to Q{cx) with Yt fixed, (ii) The 
M-step minimizes the energy C{Q,Yt) with respect to Yt with 
Q{ol) fixed. 

(i) The E-step: Approximating Q{ol) : 

The posterior of the latent variable Q{ol) is defined as, 

(3(a) = P{a\It,Yt, A) = ^ exp{-£;a(a, J(, Ft, Z\)}, (5) 

where Z is the normalization constant of the probability. The 
energy Ea. evaluates the appearance and semantics consis¬ 
tency, which is specified as, 

Eol{ol, It,Yt,A) = Esc{ol, It, A) +7 Esa{oc, Yt,A), ( 6 ) 

The first term Esc measures the appearance similarity 
between It and images in Z\, defined as, 

Esc = \\\F{It)-Boc\\l+p\\oc\\^, 


( 1 ) 


(7) 
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where P is the tradeoff parameter used to balance the sparsity 
and the reconstruction error. F(') is an m-dimemsional global 
feature of an image, and B G is a matrix consisting of 

all the features of images in A. 

The second term Esa in Eq. <1§ measures semantic consis¬ 
tency, defined as. 


Esa 


1 

2 


ijeN 


cy.'i gl j 

\l ^ii \l 


II 2 A oc^Bcx 


OL^Col + A OL^VoL, 


( 8 ) 


where Sij measures the semantic similarity between (J^, 


Z\, as. 


Sij — 


I Ej n Lj I 
I Ei U Lj I 


(9) 


and A in Eq. Jsl is a diagonal matrix where An = Sij 

md C = A-^^{A - S)A-E^^ in which L is the normalized 
Laplacian matrix. 

Images with similar semantics should be encoded with sim¬ 
ilar activations. In other words, if two images have common 
labels, then the activations corresponding to this image pair 
should also be close to each other. The distance between their 
activation codes should be small. 

P is a diagonal matrix where Vkk measures the semantic 
dissimilarity between E A and the target image It. Thus the 
second tern|^ ol^Vol penalizing the target It is reconstructed 
by images that are semantically dissimilar with It. We define 
the diagonal matrix V by 


Ekk = 1 — 




( 10 ) 


where Lt are the latent labels of the target image, which are 
unknown at the beginning and can be determined from Yt 
during the later iterations. 


(ii) The M-step: estimating Yt : 

The M-step performs to minimize the following energy 
function with respect to Yt'. 


EM{Yt) = -J2Qioc)\nP{IuYua\A). ( 11 ) 

oc 


However, summing out a for all possibilities demands very 
expensive computational cost, particularly to process a large 
number N of data. Instead, we seek a lower-bound of EM{Yt). 
Assume that we can infer a* with the maximized probability 
(5(a*) by the E-step. Then we can define the joint distribution 
of (It^Yt) conditioned on Q{cx*), and we have 

J2P(It,Yt\A;a*) > ^P(/t,F*, a|Z\). (12) 

CX OL 

It is Straightforward in the context of our task, as the cumu¬ 
lative density of assigning labels from good references {i.e. 
given a*) is higher than that with general cases. Thus, we set 
the lower-bound as, 

EM{Yt)> - Y,Q{oc)\nP{nYt,\A-oc*), (13) 


^cADcx is convex, and it is convenience for optimization. 
^We initialize Lt as the whole label set of the database. 


where Q{cx) is fixed by the last E-step. The energy to be 
minimized can be further simplified as, 

EM{Yt) = -\nP{It,Yt\A,cx*), (14) 

where we will specify — lnP(/t, Ft|Z\, a*) with a combina¬ 
torial graph model in Sec. |IV-B[ 


IV. Inference and Implementation 

Within the EM formulation, the inference algorithm iterates 
with two steps: (i) computing a* in the E-step for reference 
retrieval and (ii) solving the optimal labeling Y* with the 
selected references in the M-step. 


Algorithm 1 Adaptive Reference Retrieval 

Input: Target image feature P(/t), codebook B , semantic 

constrains A, and the threshold a for stop. 

Output: Semantical sparse coding coefficient a*. 

Initial: Initial a* in randomly , and k = 1. Denote g{cx) = 

i||F(/0-5a||2 + h cx^Acx, so Eq. dl^ can be reformulated 

- 


as Pa 

I: while — cx ^\\2 > cr do 

2: 


4: 


5: 


a||i. 

- a^||2 

Compute the gradient of g{cx) at a^, Vg{oL^) = 
B^{Bol^ -E{It))E^AoL^. 

z\ = argmin;^( 2 : - ol^)^\/ g{oL^) + P\\z\\i Y ^\\z - 
a^|| 2 , where L > 0 is a papameter. 

Iteratively increasing L by a constant factor until 
the condition g{z1) < Mg{oL^,z\) := g{oL^) Y 
Vg{oL^)^{z\ — OL^) -|- ^\\z\ — OL ^\\2 is met, else return 
to step[^ 

Update := ol^ Y — ol^), where Uk G (0,1] 


6: k:=k-i-l 

7: end while 

8: a* = OL^ 


A. Adaptive Reference Retrieval 

Maximizing Q{ol) is equivalent to minimizing the energy 
defined in Eq. § w.r.t. a* = arg minQ, PQ,(a,/t, It, A). 
Notice that Ea{cx^ f^Yt, A) can be regarded as a semantic- 
aware sparse representation, where we jointly model the 
appearance reconstruction with semantic consistency. Eig. 
intuitively illustrates this model, and it can be rewritten as, 

= ^\\E{It) - Bcxh + /3||a||i + ^7 (15) 

where A = 2{CY W). The semantic associated terms in Eq. 

can be phrased in convex forms, thus we can use the 
proximal gradient method to solve this problem efficiently. 
The optimization process is shown in Algorithmic 

Given the optimized a*, we can simply select the references 
according to coding co-efficiencies, e.g., select by threshold¬ 
ing. And we set ak = 0 if image Ik is not selected. 
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Fig. 3. Illustration of the combinatorial graphical model. The dark circles 
represent the superpixels; the fours over the square region are extracted from 
the target image while the others from references that are denoted by dashed 
regions. 


B. Aggregated Label Assignment 

Given the references determined by a*, we propagate their 
semantic labels to It by constructing a combinatorial graph. 
We extract superpixels from both It and the references as 
graph vertices, and connect them with probabilistic edges 
incorporating their affinities, as Fig. [^illustrates. 

Two types of edges are considered over the graph: (i) the 
inner-edges uj connecting the spatial neighboring superpixies 
within the target (red wavy line in Fig. , and (ii) the outer- 
edges ^ connecting the superpixels of the target to those of its 
references (straight green line in Fig. [^ . And each superpixel 
of the target connects with the q most similar superpixels of 
each reference. 

We define — lnP(/t, It |Z\, a*) in Eq. (14) on the graphical 
model as follows. 


the target and its neighboring superpixels connected by outer- 
edges in the reference image Ik, thus it implicitly exhibits the 
probability that xj sharing the same labels with its reference 

4. 

Algorithm 2 Overall procedure of our framework 

Input: Target 4 = and auxiliary A = {Ik,Lk}k=i- 

Output: Label of each superpixel Yt = 

Initial: L\ contains all labels, and n = 1. 

1: while 4^+^ ^ do 

2: Minimize defined in Eq. •dD using Alg. 

3: Sort a* in descend order, select the images correspod- 

ing to the p-first nonzero coefficients, as a set P. 

4: for all xj in It do 

5: for all image 4 in P do 

6: Select the q^-most similar superpixels O^t = 

7: Construct O^t = U/cO^t 

8 : end for 

9: Add {x\^x^) to uj for all x^ G Oj.t. 

10: Add {x\^ x^j) to ^ for all neighbors of x\, i ^ j. 

11: end for 

12: Minimize Eq. ([Tl.. Optimize the latent label Ft* using 

alpha-beta swap algorithms of graph cuts. 

13: Update as the unique set of F/. 

14: n:= n-i-1 

15: end while 


The pairwise potentials, i.e. 0(p|,xp in Eq. 
encourages the smoothness between neighboring superpixels 
within the target, as, 

(PivhVpxix]) = \\f{xl) - f{x])\\25{yl 7^ 2 /p, (19) 

where (5(-) is the indicator function. 

Thus the approximate solutions Eq. can be found using 
alpha-beta swap algorithms of graph cuts. The sketch of our 
framework is shown in Algorithm]^ 


-lnP{It,Yt\A,cx*)= ( 16 ) 

E(xjy.)eu. 4>{yi,y],x\,x*j) 

where uo is the inner edges. The optimization of Eq. 
becomes a tractable graphical model optimization problem. 

To derive the potentials of assigning labels to one vertex of 
the target 'ip{yl\cx*^ A) in Eq. (16), we propose the semantic- 
based superpixel density prior, which is defined as, 

N 

, A) = y2^lpix\,lk)5{yl e L''), (17) 

k=l 

where p{x\^ Ik) denotes the density of superpixel x\ in image 
4, which is defined as. 


p{xi,ik) = L \\f{xl) - f{x‘ 




(18) 


where ^ denotes outer-edges, is the number of outer- 
edges, and /(•) is the feature vector of a superpixel. This 
density measures the similarity between the superpixel xj in 


C. Image Annotation 

We propose a simple method to transfer n labels to a 
test image It from the query’s K nearest neighbors in the 
training set. Eor a given test image It, the sparse reconstruction 
coefficient vector a is determined by soloving the problem 
in Eq. (O, where we set A = 0, and set other parameters 
as the same as described in section |V-B1[ The optimal sparse 
coefficient solution denote as a, then let its top K largest value 
denote as ^ G consponding with image label indicator 

li G i = 1,2,...,iT. The label vector probability of test 
image can then be obtained as: 


K 

Zt = 

i=l 


( 20 ) 


where is the i-th component of vector tt. The labels cor¬ 
responding to the top few largest values in Zt are considered 
as the final annotationns of the test image. 

We compare the following two annotation methods, and find 
out that the sparse coefficient a is extremely useful for image 
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annotation, (i) weighed: That is the annotation weighed by 
sparse reconstruction coefficient (ii) unweighed: We set 
TT^ = 1 , i = 1 , • • • in manual. 

Besides, we also compared with classical works for image 
annotation, the proposed method here have the following 
characteristics: (i) the propagation process is robust and less 
sensitive to the image noises owing to the semantic constraints 
in image retrieval step, (ii) the proposed algorithm is scalable 
to large-scale, and retrieval images by jointly matching their 
appearances as well as the semantics. 

V. Experiment 

In this section, we conduct extensive experiments to validate 
the performance of our method and discuss the experimental 
analysis. We also conduct an empirical study on the effective¬ 
ness of the proposed EM iterations. 

Implemenation details: Eive parameters are required to be 
set in our framework. We set g' = 20 to construct the g^-nearst 
graph, and set p = 10 to retrieval 10 images as reference 
for each test image. In the experiment we also set A = 1 
empirically. The other parameters /3 and 7 are introduced in 
Sec. ( |VbT] >. 



Eig. 6. Illustration of the decrease energy Ea decrease w.r.t. time, x — axis 
indicates the number of iteration, and the y — axis shows the energy Ea of 
Eq. (ig. The results randomly selected from test set. 


A. Datasets 

To verify the effectivenes of our method, we conduct 
experiments on two challenging datasets, i.e. MSRC ifT^ and 
VOC 2007 O, by comparing with state-of-the-art. We use 
the standard average per-class measure {average accuracy) to 
evaluate the performance. Eor each test image, we use the 
training set as the auxiliary data for our framework. 


B. Exp-I: Image Semantic Segmentation 

1) Parameter Analysis: Specifically, we focus on the effects 
of p and 7 which control the infiuence of appearance term 
and semantic term in Eq. ([Tg, and these two parameters are 
crucial to our results. The range of p and 7 are both set to 
{0,0.05, 0.10,0.15, 0.20, 0.25, 0.30}. The semantic segmenta¬ 
tion performance is used to tune parameters. 

We used MSRC dataset to finetune the parameters. The 
results of changing the parameter values are presented in Eig. 
[ 7 ] from which we can observe the following conclusions: 


• When P and 7 increase from small values to large values, 
the performance varies apparently, which shows that the 
sparse term and semantic constraint term have great 
impacts on the performance. 

• Mean average precision (MAP) reach the peak points 
(0.71) when = 0.1 and 7 = 0.2 on MSRC which 
lie in the middle range and the precision do not increase 
monotonically when p and 7 increase. In the following 
experiments, we adopt the best parameter settings on all 
datasets. 



Eig. 7. Parameter tuning results of parameters /3 and 7 for MSRC dataset. 


2) Experiments on MSRC dataset: Given this insight, we 
compare the proposed method with the following stae-of-the- 
art algorithms: M/M ITsII . and K. Zh fVh . 

Table |T| shows that our algorithm outperforms the others. 
Benefit from the semantic constraints incorporated in our 
approach, we achieve a significant improvements for certain 
difficult classes, e.g., chair and cat. Serveral visualized results 
with the corresponding ground-truths are presented in Eig. 
4(a) l and more semantic segmentation results are in supple¬ 


mentary material as to the limited space of article. 

3) Experiments on VOC 2007 dataset: Eew performance 
on VOC 2007 dataset is reported, due to the 20 extremely 
challenging categories it contains. Here we compare with the 
weakly supervised STE lfTSl by running the code provide by 
the author. We also compare our method with ifTTll . Results 
are reported in Table and our methods outperforms ifTTl by 
3%. 


It takes about 8 seconds per image with an un-optimized 
matlab implementation for semantic segmentation, on a 64-bit 
system with Core-4 3.6 GHz CPU, 4GB memory (extracting 
features: Is; sparse coding with semantic constraints: 5s; 
optimization by GraphCuts: 2s). 

Moreover, we validate the effectiveness of the proposed EM 
iterations from two aspects. Eirst, we plot the energy Eq, in 
each iteration, which is the energy of semantic-aware spare 
coding defined in Eq. ( p~5] ), as shown in Eig. We also present 
some intermediate results during the EM iteration^ as Eig. 


^Generally, the iteration is complete after two or three steps since the 
average number of labels for each image is 3 in MSRC or VOC2007 dataset. 
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(b) 

Fig. 4. Some final results (a) and some intermediate results of semantic segmentation (b) on the MSRC dataset. The original image and its ground truth are 
shown on the left, and the semantic segmentation result by our method is on the right. It’s encouraged to be view in color. 


MSRC 

Method 

building 

grass 

tree 

cow 

sheep 


airplane 

water 

face 

o 

bicycle 

fiower 

sign 

bird 

book 

chair 

road 

"S 

o 

dog 

body 

boat 

average 

MIMfBI 

12 

83 

70 

81 

93 

84 

91 

55 

97 

87 

92 

82 

69 

51 

61 

59 

66 

53 

44 

9 

58 

67 

K. Zhfnl 

63 

93 

92 

62 

75 

78 

79 

64 

95 

79 

93 

62 

76 

32 

95 

48 

83 

63 

38 

68 

15 

69 

Ours 

45 

73 

65 

79 

81 

66 

71 

87 

75 

84 

73 

73 

94 

51 

89 

85 

42 

83 

81 

66 

32 

71 

VOC 2( 

)07 


Method 

aeroplane 

bicycle 

bird 

boat 

bottle 

3 

i-H 

a 

o 

"S 

o 

chair 

cow 

diningtable 

ton 

o 

horse 

motorbike 

person 

pottedplant 

sheep 

sofa 

train 

tvmonitor 

average 

Shotton.weaklvl 1 31 

14 

8 

11 

0 

17 

46 

5 

13 

4 

0 

30 

29 

12 

18 

40 

6 

17 

17 

14 

9 

16 

K. Zhll7l 

48 

20 

26 

25 

3 

7 

23 

13 

38 

19 

15 

39 

17 

18 

25 

47 

9 

41 

17 

33 

24 

Ours 

68 

14 

12 

16 

4 

27 

18 

12 

28 

16 

7 

46 

36 

11 

78 

18 

29 

11 

47 

41 

27 


TABLE I 

Accuracies (%) of our method for each category on MSRC and VOC 2007 dataset, in comparison with other algorithms. The last 

COLUMN IS THE AVERAGE ACCURACY OVER ALL CATEGORIES. 
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grass, sheep, road face, book, body grass, sky, aeroplane grass, face, body 


sheep, person 


horse, person 


train, person 


building, bicycle, road building, car, road grass, cow, water 


building, tree, sky 


diningtable, person 


boat, person 


motorbike, person 


bottle, person 


Fig. 5. Some example results on image annotation from the MSRC (left) and VOC 2007 dataset (right). 


shown, which empirically supports the effectiveness of 
the iterations. 

C Exp-II: Image Annotation on Test Image 

1) Benchmarks and Metrics: Three popular algorithms are 
implemented as benchmark baselines for the image annotation 
task: MAHRia, MLkNNim, ML-LOCH. 

MLkNN and ML-LOC are the state-of-the-art multi-label 
annotation algorithms in literature. They have been reported to 
outperform most other multi-label annotating algorithms, such 
as RankSVM Q. Thus, we do not plan to further implement 
the latter two in this work. We evaluate and compare among 
the three algorithms over two datasets, MSRC and VOC 2007, 
each of which is randomly and evenly split into training and 
testing subset. The image annotation performance is measured 
by mean average precision, which is widely used for evaluating 
the performances of ranking related tasks. 

2) Results and Analysis: The weighed method is outper¬ 
forms the unweighed one as Table [n| shown. It notices that the 
sparse coefficient ol is useful to improve the image annotation 
performance, and useful for image semantic segmentation 
apparently, as we do the image retrieval by jointly matching 
their appearance as well as the semantics. The larger means 
the more similar in semantics between the test image and 
image li (i.e. sharing the more common labels). 

The weighed method proposed outperforms the three classi¬ 
cal methods listed in Table |I^ Some example image annotation 
results from the MSRC and VOC 2007 dataset are shown in 
Fig. Here we only display the top 3 or 2 labels for MSRC 
and VOC 2007, since the average number of labels for each 
images in MSRC and VOC 2007 is 3 and 2 respectively. 


Dataset 

MAHR 

MLkNN 

ML-LOC 

unweighed 

weighed 

MSRC 

49.5 

70.8 

77.3 

76.1 

84.7 

VOC 2007 

34.0 

47.6 

48.9 

45.8 

57.5 


TABLE II 

Image label annotation MAP (Mean Average Precision) 

COMPARISONS ON TWO DIEEERENT DATASETS. 


VI. Conclusions 

In this paper proposes a new framework for data-driven 
semantic image segmentation where only image-level labels 
are available, and it is also useful for image annotation. 
Compared with the traditional supervised learning methods. 


our framework is more fiexible for real applications such as 
online image retrieval. In the experiments, we demonstrate 
very promising results on the standard benchmarks of scene 
understanding. In future work, we can improve the algorithm 
efficiency by utilizing parallel implementation and validate our 
approach on larger scale datasets. 
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